AITopics | nlg metric

Collaborating Authors

nlg metric

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

PolyPath: Adapting a Large Multimodal Model for Multi-slide Pathology Report Generation

Ahmed, Faruk, Yang, Lin, Jaroensri, Tiam, Sellergren, Andrew, Matias, Yossi, Hassidim, Avinatan, Corrado, Greg S., Webster, Dale R., Shetty, Shravya, Prabhakara, Shruthi, Liu, Yun, Golden, Daniel, Wulczyn, Ellery, Steiner, David F.

arXiv.org Artificial IntelligenceFeb-14-2025

Recent applications of vision-language modeling in digital histopathology have been predominantly designed to generate text describing individual regions of interest extracted from a single digitized histopathology image, or Whole Slide Image (WSI). An emerging line of research approaches the more practical clinical use case of slide-level text generation (Ahmed et al., 2024, Chen et al., 2024). However, in the typical clinical use case, there can be multiple biological tissue parts associated with a case, with each part having multiple slides. Pathologists write up a report summarizing their part-level diagnostic findings by microscopically reviewing each of the available slides per part and integrating information across these slides. This many-to-one relationship of slides to clinical descriptions is a recognized challenge for vision-language modeling in this space (Ahmed et al., 2024). The common approach taken in recent literature is to restrict modeling and analysis to single-slide cases or to manually identify a single slide within a case or part that is most representative of the clinical findings in reports (Ahmed et al., 2024, Chen et al., 2024, Guo et al., 2024, Shaikovski et al., 2024, Xu et al., 2024, Zhou et al., 2024). This strategy of selecting representative slides was also adopted in constructing one of the most widely used histopathology datasets, TCGA (Cooper et al., 2018).

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2502.10536

Country: North America > United States > Maryland (0.04)

Genre:

Research Report > Experimental Study (0.70)
Research Report > New Finding (0.46)

Industry:

Health & Medicine > Therapeutic Area > Oncology (1.00)
Health & Medicine > Diagnostic Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.54)

Add feedback

Can We Trust the Performance Evaluation of Uncertainty Estimation Methods in Text Summarization?

He, Jianfeng, Yang, Runing, Yu, Linlin, Li, Changbin, Jia, Ruoxi, Chen, Feng, Jin, Ming, Lu, Chang-Tien

arXiv.org Artificial IntelligenceJun-25-2024

Text summarization, a key natural language generation (NLG) task, is vital in various domains. However, the high cost of inaccurate summaries in risk-critical applications, particularly those involving human-in-the-loop decision-making, raises concerns about the reliability of uncertainty estimation on text summarization (UE-TS) evaluation methods. This concern stems from the dependency of uncertainty model metrics on diverse and potentially conflicting NLG metrics. To address this issue, we introduce a comprehensive UE-TS benchmark incorporating 31 NLG metrics across four dimensions. The benchmark evaluates the uncertainty estimation capabilities of two large language models and one pre-trained language model on three datasets, with human-annotation analysis incorporated where applicable. We also assess the performance of 14 common uncertainty estimation methods within this benchmark. Our findings emphasize the importance of considering multiple uncorrelated NLG metrics and diverse uncertainty estimation methods to ensure reliable and efficient evaluation of UE-TS techniques.

consistency, nlg metric, unieval, (12 more...)

arXiv.org Artificial Intelligence

2406.17274

Country:

North America > United States > Virginia > Falls Church (0.04)
North America > United States > Texas > Dallas County > Richardson (0.04)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Is ChatGPT a Good NLG Evaluator? A Preliminary Study

Wang, Jiaan, Liang, Yunlong, Meng, Fandong, Sun, Zengkui, Shi, Haoxiang, Li, Zhixu, Xu, Jinan, Qu, Jianfeng, Zhou, Jie

arXiv.org Artificial IntelligenceOct-24-2023

Recently, the emergence of ChatGPT has attracted wide attention from the computational linguistics community. Many prior studies have shown that ChatGPT achieves remarkable performance on various NLP tasks in terms of automatic evaluation metrics. However, the ability of ChatGPT to serve as an evaluation metric is still underexplored. Considering assessing the quality of natural language generation (NLG) models is an arduous task and NLG metrics notoriously show their poor correlation with human judgments, we wonder whether ChatGPT is a good NLG evaluation metric. In this report, we provide a preliminary meta-evaluation on ChatGPT to show its reliability as an NLG metric. In detail, we regard ChatGPT as a human evaluator and give task-specific (e.g., summarization) and aspect-specific (e.g., relevance) instruction to prompt ChatGPT to evaluate the generated results of NLG models. We conduct experiments on five NLG meta-evaluation datasets (including summarization, story generation and data-to-text tasks). Experimental results show that compared with previous automatic metrics, ChatGPT achieves state-of-the-art or competitive correlation with human judgments in most cases. In addition, we find that the effectiveness of the ChatGPT evaluator might be influenced by the creation method of the meta-evaluation datasets. For the meta-evaluation datasets which are created greatly depending on the reference and thus are biased, the ChatGPT evaluator might lose its effectiveness. We hope our preliminary study could prompt the emergence of a general-purposed reliable NLG metric.

chatgpt, computational linguistic, correlation, (11 more...)

arXiv.org Artificial Intelligence

2303.04048

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
Asia > China > Beijing > Beijing (0.04)
(14 more...)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Improving the Factual Correctness of Radiology Report Generation with Semantic Rewards

Delbrouck, Jean-Benoit, Chambon, Pierre, Bluethgen, Christian, Tsai, Emily, Almusa, Omar, Langlotz, Curtis P.

arXiv.org Artificial IntelligenceOct-21-2022

Neural image-to-text radiology report generation systems offer the potential to improve radiology reporting by reducing the repetitive process of report drafting and identifying possible medical errors. These systems have achieved promising performance as measured by widely used NLG metrics such as BLEU and CIDEr. However, the current systems face important limitations. First, they present an increased complexity in architecture that offers only marginal improvements on NLG metrics. Secondly, these systems that achieve high performance on these metrics are not always factually complete or consistent due to both inadequate training and evaluation. Recent studies have shown the systems can be substantially improved by using new methods encouraging 1) the generation of domain entities consistent with the reference and 2) describing these entities in inferentially consistent ways. So far, these methods rely on weakly-supervised approaches (rule-based) and named entity recognition systems that are not specific to the chest X-ray domain. To overcome this limitation, we propose a new method, the RadGraph reward, to further improve the factual completeness and correctness of generated radiology reports. More precisely, we leverage the RadGraph dataset containing annotated chest X-ray reports with entities and relations between entities. On two open radiology report datasets, our system substantially improves the scores up to 14.2% and 25.3% on metrics evaluating the factual correctness and completeness of reports.

machine learning, natural language, relation, (18 more...)

arXiv.org Artificial Intelligence

2210.12186

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > Indiana (0.04)
Oceania > Australia > Victoria > Melbourne (0.04)
(2 more...)

Genre: Research Report (1.00)

Industry:

Health & Medicine > Nuclear Medicine (1.00)
Health & Medicine > Diagnostic Medicine > Imaging (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.86)

Add feedback

Explaining Chest X-ray Pathologies in Natural Language

Kayser, Maxime, Emde, Cornelius, Camburu, Oana-Maria, Parsons, Guy, Papiez, Bartlomiej, Lukasiewicz, Thomas

arXiv.org Artificial IntelligenceJul-9-2022

Most deep learning algorithms lack explanations for their predictions, which limits their deployment in clinical practice. Approaches to improve explainability, especially in medical imaging, have often been shown to convey limited information, be overly reassuring, or lack robustness. In this work, we introduce the task of generating natural language explanations (NLEs) to justify predictions made on medical images. NLEs are human-friendly and comprehensive, and enable the training of intrinsically explainable models. To this goal, we introduce MIMIC-NLE, the first, large-scale, medical imaging dataset with NLEs. It contains over 38,000 NLEs, which explain the presence of various thoracic pathologies and chest X-ray findings. We propose a general approach to solve the task and evaluate several architectures on this dataset, including via clinician assessment.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2207.04343

Country:

Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
Europe > United Kingdom > England > Greater London > London (0.04)
Europe > Austria (0.04)

Genre: Research Report (1.00)

Industry: Health & Medicine > Diagnostic Medicine > Imaging (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Jury: Evaluating performance of NLG models

#artificialintelligenceJul-28-2021, 19:35:54 GMT

Jury is an evaluation package for NLG systems. It allows using many metrics in one go. Also, it implements concurrency among evaluation metrics and supports evaluating with multiple predictions. Jury uses datasets package for metrics, and thus supports any metrics that datasets package has. Default evaluation metrics are, BLEU, METEOR and ROUGE-L.

evaluation metric, metric, prediction, (16 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Natural Language > Generation (0.54)

Add feedback